Julia Kaznowska
Warsaw University of Technology
Faculty of Mathematics and Information Science
WB XAI tabular
import numpy as np
import pandas as pd
import pickle
import dalex
import shap
from sklearn.ensemble import RandomForestRegressor
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
import warnings
warnings.filterwarnings("ignore")
pd.options.mode.chained_assignment = None
I am using trained model from kaggle (https://www.kaggle.com/code/ravichaubey1506/end-to-end-machine-learning/notebook). I needed help with this part of homework, so I must give credit to Dawid Płudowski for showing me how to import the model to my raport.
# code necessery to load model from pickle
rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
def __init__(self, add_bedrooms_per_room = True):
self.add_bedrooms_per_room = add_bedrooms_per_room
def fit(self, X, y=None):
return self
def transform(self, X):
rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
population_per_household = X[:, population_ix] / X[:, households_ix]
if self.add_bedrooms_per_room:
bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
return np.c_[X, rooms_per_household, population_per_household,
bedrooms_per_room]
else:
return np.c_[X, rooms_per_household, population_per_household]
model = pickle.load(open('full_model.pkl', 'rb'))
train_data = pickle.load(open('test_dataset.pkl', 'rb'))
X = train_data.drop(columns=['median_house_value'])
y = train_data['median_house_value']
X.head()
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | ocean_proximity | income_cat | |
|---|---|---|---|---|---|---|---|---|---|---|
| 5241 | -118.39 | 34.12 | 29.0 | 6447.0 | 1012.0 | 2184.0 | 960.0 | 8.2816 | <1H OCEAN | 5 |
| 10970 | -117.86 | 33.77 | 39.0 | 4159.0 | 655.0 | 1669.0 | 651.0 | 4.6111 | <1H OCEAN | 4 |
| 20351 | -119.05 | 34.21 | 27.0 | 4357.0 | 926.0 | 2110.0 | 876.0 | 3.0119 | <1H OCEAN | 3 |
| 6568 | -118.15 | 34.20 | 52.0 | 1786.0 | 306.0 | 1018.0 | 322.0 | 4.1518 | INLAND | 3 |
| 13285 | -117.68 | 34.07 | 32.0 | 1775.0 | 314.0 | 1067.0 | 302.0 | 4.0375 | INLAND | 3 |
explainer_rf = dalex.Explainer(model, X, y, label = "housing RF model")
Preparation of a new explainer is initiated -> data : 4128 rows 10 cols -> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray. -> target variable : 4128 values -> model_class : sklearn.ensemble._forest.RandomForestRegressor (default) -> label : housing RF model -> predict function : <function yhat_default at 0x0000023160160CA0> will be used (default) -> predict function : Accepts only pandas.DataFrame, numpy.ndarray causes problems. -> predicted values : min = 4.95e+04, mean = 2.08e+05, max = 5e+05 -> model type : regression will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -2.49e+05, mean = -1.66e+03, max = 2.94e+05 -> model_info : package sklearn A new explainer has been created!
obs = X.iloc[[2137, 31]]
obs
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | ocean_proximity | income_cat | |
|---|---|---|---|---|---|---|---|---|---|---|
| 8591 | -118.36 | 33.89 | 40.0 | 756.0 | 122.0 | 371.0 | 130.0 | 5.0299 | <1H OCEAN | 4 |
| 15970 | -122.41 | 37.71 | 28.0 | 5015.0 | 1240.0 | 3900.0 | 1029.0 | 1.2269 | NEAR BAY | 1 |
result = model.predict(obs)
result
array([314796.7, 176880. ])
Out of the curiosity, I wanted to check whether the predicted values were anywhere near real values.
y.iloc[[2137, 31]]
8591 329200.0 15970 181900.0 Name: median_house_value, dtype: float64
The first predicted value is about 4% smaller than the real one. The second one is about 3% smaller. It is not a bad prediction, at least for those records.
bd1 = explainer_rf.predict_parts(
new_observation = X.iloc[[2137]],
type = "break_down")
bd1
| variable_name | variable_value | variable | cumulative | contribution | sign | position | label | |
|---|---|---|---|---|---|---|---|---|
| 0 | intercept | 1 | intercept | 207980.222400 | 207980.222400 | 1.0 | 11 | housing RF model |
| 1 | population | 371.0 | population = 371.0 | 258798.995753 | 50818.773353 | 1.0 | 10 | housing RF model |
| 2 | total_bedrooms | 122.0 | total_bedrooms = 122.0 | 272007.479974 | 13208.484222 | 1.0 | 9 | housing RF model |
| 3 | income_cat | 4.0 | income_cat = 4.0 | 283983.091457 | 11975.611483 | 1.0 | 8 | housing RF model |
| 4 | ocean_proximity | <1H OCEAN | ocean_proximity = <1H OCEAN | 297650.852374 | 13667.760917 | 1.0 | 7 | housing RF model |
| 5 | latitude | 33.89 | latitude = 33.89 | 323291.432752 | 25640.580378 | 1.0 | 6 | housing RF model |
| 6 | median_income | 5.03 | median_income = 5.03 | 336336.675832 | 13045.243080 | 1.0 | 5 | housing RF model |
| 7 | longitude | -118.4 | longitude = -118.4 | 351651.423474 | 15314.747642 | 1.0 | 4 | housing RF model |
| 8 | housing_median_age | 40.0 | housing_median_age = 40.0 | 386909.360401 | 35257.936927 | 1.0 | 3 | housing RF model |
| 9 | households | 130.0 | households = 130.0 | 365551.219776 | -21358.140625 | -1.0 | 2 | housing RF model |
| 10 | total_rooms | 756.0 | total_rooms = 756.0 | 314796.700000 | -50754.519776 | -1.0 | 1 | housing RF model |
| 11 | prediction | 314796.700000 | 314796.700000 | 1.0 | 0 | housing RF model |
bd1.plot()
We can see that population has the biggest positive effect on break-down, whereas total_rooms has the biggest negative effect. Considering values, they are nearly cancelling each other.
bd2 = explainer_rf.predict_parts(
new_observation = X.iloc[[31]],
type = "break_down")
bd2
| variable_name | variable_value | variable | cumulative | contribution | sign | position | label | |
|---|---|---|---|---|---|---|---|---|
| 0 | intercept | 1 | intercept | 207980.222400 | 207980.222400 | 1.0 | 11 | housing RF model |
| 1 | households | 1029.0 | households = 1029.0 | 256080.221786 | 48099.999386 | 1.0 | 10 | housing RF model |
| 2 | total_rooms | 5015.0 | total_rooms = 5015.0 | 263591.643451 | 7511.421665 | 1.0 | 9 | housing RF model |
| 3 | longitude | -122.4 | longitude = -122.4 | 304566.870874 | 40975.227422 | 1.0 | 8 | housing RF model |
| 4 | ocean_proximity | NEAR BAY | ocean_proximity = NEAR BAY | 321394.812355 | 16827.941481 | 1.0 | 7 | housing RF model |
| 5 | total_bedrooms | 1240.0 | total_bedrooms = 1240.0 | 329175.071302 | 7780.258947 | 1.0 | 6 | housing RF model |
| 6 | latitude | 37.71 | latitude = 37.71 | 329283.471673 | 108.400371 | 1.0 | 5 | housing RF model |
| 7 | housing_median_age | 28.0 | housing_median_age = 28.0 | 345222.651768 | 15939.180095 | 1.0 | 4 | housing RF model |
| 8 | income_cat | 1.0 | income_cat = 1.0 | 322739.511588 | -22483.140181 | -1.0 | 3 | housing RF model |
| 9 | population | 3900.0 | population = 3900.0 | 231292.674685 | -91446.836902 | -1.0 | 2 | housing RF model |
| 10 | median_income | 1.227 | median_income = 1.227 | 176880.000000 | -54412.674685 | -1.0 | 1 | housing RF model |
| 11 | prediction | 176880.000000 | 176880.000000 | 1.0 | 0 | housing RF model |
bd2.plot()
We can see that for the second record the breakdown plot is completely different. population has the biggest effect from all the values, and it is negative, in contrast to our first record. The biggest positive effect - households - is significantly lower than the negative one.
shap1 = explainer_rf.predict_parts(
new_observation = X.iloc[[2137]],
type = "shap")
shap1
| variable | contribution | variable_name | variable_value | sign | label | B | |
|---|---|---|---|---|---|---|---|
| 0 | median_income = 5.03 | 17372.107994 | median_income | 5.0299 | 1.0 | housing RF model | 1 |
| 1 | total_bedrooms = 122.0 | 14797.704215 | total_bedrooms | 122.0 | 1.0 | housing RF model | 1 |
| 2 | housing_median_age = 40.0 | 14371.580394 | housing_median_age | 40.0 | 1.0 | housing RF model | 1 |
| 3 | income_cat = 4.0 | 7003.468112 | income_cat | 4 | 1.0 | housing RF model | 1 |
| 4 | population = 371.0 | 54319.644065 | population | 371.0 | 1.0 | housing RF model | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 5 | ocean_proximity = <1H OCEAN | 17453.004906 | ocean_proximity | <1H OCEAN | 1.0 | housing RF model | 0 |
| 6 | longitude = -118.4 | 14728.346297 | longitude | -118.36 | 1.0 | housing RF model | 0 |
| 7 | latitude = 33.89 | 13620.265255 | latitude | 33.89 | 1.0 | housing RF model | 0 |
| 8 | housing_median_age = 40.0 | 11166.086539 | housing_median_age | 40.0 | 1.0 | housing RF model | 0 |
| 9 | total_bedrooms = 122.0 | 7267.071800 | total_bedrooms | 122.0 | 1.0 | housing RF model | 0 |
260 rows × 7 columns
shap1.plot()
We can see that for the first observation breakdown plot and shapley plot looks similar. All the variables have the same "sign" of impact (positive / negative) for both of them. The top values are also the same.
shap2 = explainer_rf.predict_parts(
new_observation = X.iloc[[31]],
type = "shap")
shap2
| variable | contribution | variable_name | variable_value | sign | label | B | |
|---|---|---|---|---|---|---|---|
| 0 | households = 1029.0 | 48099.999386 | households | 1029.0 | 1.0 | housing RF model | 1 |
| 1 | longitude = -122.4 | 35534.741877 | longitude | -122.41 | 1.0 | housing RF model | 1 |
| 2 | population = 3900.0 | -72229.004901 | population | 3900.0 | -1.0 | housing RF model | 1 |
| 3 | income_cat = 1.0 | -18947.995373 | income_cat | 1 | -1.0 | housing RF model | 1 |
| 4 | total_bedrooms = 1240.0 | 2007.653222 | total_bedrooms | 1240.0 | 1.0 | housing RF model | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 5 | income_cat = 1.0 | -19831.217657 | income_cat | 1 | -1.0 | housing RF model | 0 |
| 6 | total_rooms = 5015.0 | -1523.630225 | total_rooms | 5015.0 | -1.0 | housing RF model | 0 |
| 7 | housing_median_age = 28.0 | 1225.994401 | housing_median_age | 28.0 | 1.0 | housing RF model | 0 |
| 8 | total_bedrooms = 1240.0 | -1140.753670 | total_bedrooms | 1240.0 | -1.0 | housing RF model | 0 |
| 9 | latitude = 37.71 | -438.521995 | latitude | 37.71 | -1.0 | housing RF model | 0 |
260 rows × 7 columns
shap2.plot()
Different situation appears in the case of second observation. lattitude variable has a positive impact on breakdown, whereas it has negative impact on shapley. It means that it is probably correlated to the other variable in the model.
We can also see that the top-impacting variables are different. Although the population is the highest negative in both breakdown and shapley, the positive one is different - it is longitude in shapley.
Unfortunately, I didn't find how to make shapley plots with boxplots in them, which would make it easier to find potentialy correlated values. We can conclude though, based on second observation and lattitude, that the geographic coordinates might be connected to each other. Correlated values are worth of further investigation.